Multiple Linear Regression OLS

mlr-ols

Most machine learning algorithms use gradient descent to learn the optimal parameters.

However, in addition to gradient descent, linear regression can model data using another technique called ordinary least squares (OLS).

Ordinary Least Square (OLS):

  • It is a deterministic algorithm. If run multiple times, it will always converge to the same weights.

  • It always finds the optimal solution.

  • The solution is mathematically framed as follows:

\[\large\textcolor{#2A7B9B}{\theta} = (X^{T}X)^{-1}X^{T}y\]

The above visual depicts two mathematical derivations of parameters obtained using OLS.

Proof #1: Using Matrix Manipulation

Step 1: With OLS, the idea is to find the set of parameters (\(\theta\)) such that:

\(\displaystyle\textcolor{#C224D1}\Rightarrow y = X\theta\)

where,

  • \(X\): input data with dimensions (\(n,m\)).

  • \(\theta\): parameters with dimensions (\(m,1\)).

  • \(y\): output data with dimensions (\(n,1\)).

  • \(n\): number of samples.

  • \(m\): number of features.

Step 2: The parameter matrix \(\theta\) may be directly determined by multiplying both sides of the equation with the inverse of \(X\), as shown below:

\(\displaystyle\textcolor{#C224D1}\Rightarrow \theta = X^{-1}y\)

But that will only work if \(X\) is a square matrix (and has a non-zero determinant).

Inverse vs non-inverse

Step 3: To resolve this, first, we multiply with the transpose of \(X\) i.e., \(X^T\) on both sides, as shown below:

\(\displaystyle\textcolor{#C224D1}\Rightarrow X^{T}X\theta = X^{T}y\)

This makes the product of \(X\) with its transpose a square matrix.

square matrix

The obtained matrix, being square, can be inverted (provided it is non-singular).

Step 4: Lastly, we take the collective inverse of the product to get the following:

\[\large\displaystyle\textcolor{#2A7B9B}\theta = (X^{T}X)^{-1}X^{T}y\]

Proof #2: Using Calculas

Step 1: Essentially, the goal of linear regression is to minimize the squared error:

\(\displaystyle\textcolor{#C224D1}\Rightarrow L = ||y - X\theta||^{2}\)

Thus, we can use calculus to find the parameters Θ that minimize the squared error.

Step 2: Simplify the squared norm above:

\(\displaystyle\textcolor{#C224D1}\Rightarrow L = (y - X\theta)^T(y - X\theta)\)

Expanding the expression, we get:

\(\displaystyle L = y^{T}y - y^{T}X\theta - (X\theta)^Ty + (X\theta)^TX\theta\)

Now note:

  • \(y^T X\theta\) and \((X\theta)^Ty\) are scalar.
  • Scalars are equal to their transpose, so \((X\theta)^Ty = ((X\theta)^Ty)^T = y^T X\theta\).

So we can rewrite:

\(\displaystyle\textcolor{#C224D1}\Rightarrow L = y^Ty - 2y^T X\theta + \theta^T X^T X \theta\)

Step 3: Differentiate the above expression with respect to \(\theta\) and simplify:

\(\displaystyle\textcolor{#C224D1}\Rightarrow \frac{dL}{d\theta} = -2X^{T}y + 2X^TX\theta\)

as,

  • \(y^Ty\) : constant (independent of \(\theta\)), derivative = 0.

  • \(-2y^T X\theta\) : this is linear in \(\theta\).

    • (Numerator layout) Derivative w.r.t. \(\theta\): \(-2X^T y\).
  • \(\theta^T X^T X \theta\) : quadratic form.

    • (Numerator layout) Derivative w.r.t. \(\theta\): \(2X^T X \theta\) (since \(X^T X\) is symmetric).
Operation (Derivative of Matrix) Numerator Layout (Result is a column vector) Denominator Layout (Result is a row vector)
A linear form: \(a^T x\) \[\frac{\partial(a^T x)}{\partial x} = a\] \[\frac{\partial(a^T x)}{\partial x} = a^T\]
A quadratic form: \(x^T A x\) \[\frac{\partial(x^T A x)}{\partial x} = 2Ax\](for symmetric A) \[\frac{\partial(x^T A x)}{\partial x} = 2x^T A\] (for symmetric A)

Step 4: Set the derivative to zero:

\(\displaystyle\textcolor{#C224D1}\Rightarrow -2X^{T}y + 2X^TX\theta = 0\)

After rearranging, we get:

\(\displaystyle\textcolor{#C224D1}\Rightarrow X^TX\theta = X^{T}y\)

Step 5: Invert the product of X and its transpose:

\[\displaystyle \large \textcolor{#2A7B9B}\theta = (X^{T}X)^{-1}X^{T}y\]

And that’s how we get to the OLS solution.